On convergence rates of Bayesian predictive densities 

and posterior distributions 



(N 

o 

(N 

OV 
(N 



H 



> 
m 
O 

o 
d 

(N 



Ryan Martin 

Department of Mathematics, Statistics, and Computer Science 
University of Illinois at Chicago 



rgmart inOmat h .uic.edu 



Liang Hong 
Department of Mathematics 
Bradley University 



lhong@bradley . edu 



October 2, 2012 



Abstract 

Frequentist-style large-sample properties of Bayesian posterior distributions, 
such as consistency and convergence rates, are important considerations in non- 
parametric problems. In this paper we give an analysis of Bayesian asymptotics 
based primarily on predictive densities. Our analysis is unified in the sense that 
essentially the same approach can be taken to develop convergence rate results in 
iid, mis-specified iid, independent non-iid, and dependent data cases. 

Keywords and phrases: Density estimation; Hellinger distance; Kullback-Leibler 
divergence; Markov process; nonparametric; separation. 
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1 Introduction 

In Bayesian nonparametric problems, asymptotic concentration properties of the pos- 
terior distribution are often key to motivating a particular choice of prior. Indeed, 
for infinite-dimensional problems, elicitation of subjective priors is difficult and a the- 
ory of objective priors for remains elusive, so large-sample properties of the posterior 
are often what drives the choice of prior. A desirable (frequentist) property can be 
summarized roughly as follows: for a given "true model" and prior, as more and more 
data becomes available, the posterior distribution becomes more and more concentrated 
ar ound this tru e mo del with large p robability. Early efforts along these lines are given 



in 



Doobl (119491 ) and ISchwartzl (119651) . Stronger results, some inc lu ding rates of conver 



gence, are presented i n iBarron et al.l (119991). iGhosal et al.l (119991). IGhosal et al.l (2000) 
Shen and Wassermanl fcOOlh . IGhosal and van der Vaartl fl200ll . l2007bt ). Ilbkdarl (l2006l i 
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and IWalker et al.l (120071). Modern effor ts include extensions to non-Euclidean s ample 
spaces ( Bhattacharva and Dunson 2010), more complex mode l s (Pati et al.l 2011 ). and 
misspecified models (IKleijn and van der Vaartll2006l ; lLianll2009l : IShalizil 120091 ) . 

This paper presents a sort of unified analysis of Bayesian posterior convergence rates 
based on predictive densities. These predictive densities are fundamental quantities in 
Bayesian statistical inference — they are the Bayes estimates of the density under a variety 
of loss functions. This connection with predictive densities is not new, but the extent to 
which we depend on these quantities gives our analysis a strong Bayesian flavor. More- 
over, we show how essentially the same techniques can be used to develop convergence 
rate theorems for a variety of models, including iid, mis-specified iid, independent non-iid, 
and dependent data. In particular, we prove (apparently) new Cesaro-style convergence 
rate results for predictive densities, in each of the four contexts above, under weaker 
conditions than usual for posterior convergence rate theorems. We also develop a funda- 
mental lemma, also based on predictive densities, which helps bound the numerator of 
the posterior probability for sets not -too-close to t he tru e data-generating density. This 
result is similar to Proposition 1 in I Walker et al.l ( 120071 ). but the proof is different and 



applies almost word-for-word in a variety of contex ts. It also relies an inter e sting notion 
of separation of points and sets, apparently due to lChoi and Ramamoorthil (120081 ) . This 
lemma is then used to prove posterior convergence rate theorems. 

The remainder of the paper is organized as follows. Section[2]develops the notation and 
terminology used throughout the paper, in particular, the notion of prior thickness at the 
true data-generating density, and separation of sets from this same density. Sections [3] 
and S] cover predictive density and posterior convergence rates, respectively, both in 
the simplest iid context. In Section HI we prove an auxilia ry result that demonstrates 



the sieve + covering style c onditions in iGhosal et al.l ( 120001 ) are weaker than the prior 



summability conditions in I Walker et al.l ( 120071 ). The results on predictive density and 
posterior convergence rate theorems are extended to the mis-specified iid, independent 
non-iid, and dependent cases in Sections [5HZI respectively. Finally, some concluding 
remarks are given in Section [HJ 



2 Bayesian nonparametrics 
2.1 Notation and definitions 

Let Y be a Polish space equipped with its Borel a-algebra W . Suppose Y\ xn = (Yi, . . . , Y n ), 
n > 1, are independent Y- valued observations with common distribution F, and that F 
has a density / = dFjd[i with respect to some cx-finite measure \i on (Y, Let F be 
the set of all such densities / and & its Borel cx-algebra. Then a prior distribution II for 
/ is a probability measure on the measurable space (F, Following Bayes' theorem, 
the posterior distribution of /, given Yi :n , can be written as 

U n (A) = U(A 1 Y 1M ) = J ^b f J%\ ffffi , Ae^. (1) 

JfU^i A^) n W) 

We require a topology on F, and here we focus on the Hellinger topology. The Hellinger 
distance H is given by H(f, f) = {J(f 1/2 - f l/2 f d/i} 1 / 2 , for /, f G F. 
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For describing large-sample properties of the posterior, it is standard to assume that 
there is a "true density" /* from which the data Y\, . . . , Y n are observed. It shall be 
required that the prior II puts a sufficient amount of mass around this /*; the precise 
conditions on II are stated in Section I2T21 With "true density" /*, it is typical to rewrite 
the posterior ([T]) as 

n„ ( ^4#gWf> Ae ,, (2) 



where Ro(f) = 1 and 



^(/) = f[/(WM. «>1- ( 3 ) 



1=1 



In what follows, we will occasionally refer to the posterior IT n , restricted to a given set 
A. By that we mean the measure 11^ defined as IT^(-) = U n (A D -)/II n (^4). Also, < and 
> will denote inequality up to a universal constant. 

Convergence rates of the posterior IT n concerns the amount of probability assigned to 
(expanding) sets that do not contain the true density /* when n is large. Let (e n ) be 
a positive vanishing sequence. Then the posterior Il n has a Hellinger convergence rate 
e n if II n ({/ : H(f*,f) > e n }) — > in probability. Here, and in what follows, the "in 
probability" qualification is with respect to P = PJl, the product distribution, under /*, 
of the infinite data sequence Y Voo = (Yi, Y 2 , . . .). 



2.2 Prior support conditions 

In order for the posterior distribution to concentrate around /*, some support conditions 
on the prior II are needed. For example, if there exists a set A 3 /* such that H{A) = 0, 
then, trivially, the posterior cannot concentrate around /*. To avoid these kinds of 
degeneracies, it is typical to assume that II satisfies the Kullback-Leibler property, i.e., 
that n({/ : K(f\ f) < e}) > for all e > 0, where K(f\ f) = f 'logf f* I f ) f* du is the 
Kullback-Leibler divergence of / from /*. See lWu and Ghosall (120081 . 120101 ) for sufficient 
conditions and a host of examples. While the Kullback-Leibler property itself is not a 
necessary condition for posterior convergence, it does make up p art of an important and 
useful set of sufficient conditions, developed bv ISchwarta (Il965l ). Indeed, the Kullback- 



Leibler property alone is enough to imply that the posterior is weakly consistent. But 
extra conditions, beyon d the Kullback-Leibler propert y, are generally needed to establish 
strong consistency; see IChoi and Ramamoorthil (120081 ) . 

To establish rates of convergence, something even stronger than the usual Kullback- 
Leibler property is needed. Set V(f*,f) = J{log(/*//)} 2 /* d\i. 



Definition 1. Let (e n ) be a positive sequence such that e n — > and ne 2 n 
prior II is said to be e n -thick at /* if, for some constant C > 0, 



— > oo. The 



n({/ : K{f\ f) < el V(f\ f) < el}) > e~ c ^. 



(4) 



This is exactly condition (2.4) in iGhosal et al.l (120001 ). which they motivate with en- 
tropy considerations. Since (j3J) is stronger than the Kullback-Leibler property, prior 
thickness can be seen as a support condition on the prior, guaranteeing that the prior 
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assigns a sufficient amount of mass near /*. Beyond this intuition, the following tech- 
nical lemma, giving a lower bound on the denominator in (j2J), is a consequence of prior 
thickness. See Ghosal et al.l ( 2000[ Lemma 8.1 and the proof of Theorem 2.1). 



Lemma 1. Let I n — J R n (f)U(df) be the denominator in (J2J). If U is e n -thick at f*, 
then P{I n < e~ cne ") — > for any c > C + 1 with C as in (S}. 



Next is a simple application of LemmaHJ similar to Proposition 4.4.2 in lGhosh and Ramamoorthi 



( 120031 ). that will be used in the proof of the main results. 

Lemma 2. Assume U is e n -thick at f*. For a sequence (U n ) C & , suppose that H(U n ) < 
e~ rn£n , where r > C + 1, with C as in <^j. Then U n (U n ) — > in probability. 

Proof. Write U n (U n ) = L n /I n . Using Markov's inequality and the assumption on U(U n ), 
it is easy to check that P(L n > e~ cn£n ) < e~^~ c ^ ne ". Therefore, if c G (C + l,r), 
then P(e cn£n L n > n) — > for any 77 > 0. Similarly, from Lemma HJ for the same c, 
P{In < e~ cn£n ) — > 0. Then by the law of total probability, 



P{U n (U n ) >n} = P(L n /I n > n) 

= P(L n /I n >r),I n < e- cne ") + P(L n /I n > r), I n > e-™*) 

< P(J„ < e~ cnel ) + P(L n /I n > V ,In> e~ cn£ ") (5) 

< P(I n < e" cne ") + P{e cnel L n > 77). 

Both quantities on the right-hand side vanish with n, so U n (U n ) — > in probability. □ 
2.3 Convexity and separation 

Choi and Ramamoorthil ( 20081 ) make use of two important properties for subsets A of F. 



Here we define and discuss these properties. 

Definition 2. A set A C F is convex if, for any probability measure $ supported on A, 
the expectation, = f f §(df), also belongs to A. 

Examples of convex subsets of F include balls, i.e., all those / within a specified 
distance from a center /o- For an important example, let h = H 2 /2, a slight modification 
of the squared Hellinger distance. Choose a point fo G F and let A = {/ : h(fo, f) < r}. 
Now, take any probability measure $ supported on A. Then by convexity of h and 
definition of A, we have 



Hfo,U)< [ h(f J)Hdf)<r. 

J A 



Therefore, is in A and, hence, A is convex. In the applications that follow, the 
probability measure $ will often be a truncated version of the posterior distribution. 

Definition 3. A density /* G F and a set iCF are (^-separated (with respect to h) if 
h(f\ f)>5 for all / in A. 
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For an important example, choose r > and fo such that H(f*, fo) > r. Then /* 
and A — {/ : H(fo, /) < r/2} are ^-separated, with 5 = r 2 /8. To see this, note that the 
triangle inequality implies 

H(rj)>H(rj )-H(f j) v/ga 

From the definitions of fo and A, the right-hand side is strictly greater than r/2. There- 
fore, h(f*, f) > r 2 /8, so /* and A are (r 2 /8)-separated (with respect to h). 

In the applications that follow, we shall extend this idea in two directions. First, in 
some cases, we need separation with respect to distances other than Hellinger distance H 
(or h). Second, we shall consider sequences of sets (A n ) and sequences of numbers (S n ). 
Then the notion of 5 n -separation of a density /* and sets A n is straightforward. 



3 Convergence rates for predictive densities 

Predictive densities are fundamental quantities in Bayesian analysis. Indeed, they are 
the Bayes density estimators under a variety of different loss functions. In particular, the 
predictive density of given Yi, . . . , is 



h-M = J /(y)lW4f), 



the posterior expectation of f(y). For example, in a density estimation problem with 
Hellinger distance as the loss function, the predictive density f n is the Bayes estimator 
of / in the sense that it minimizes Bayes risk. 

Our first result develops a Kullback-Leibler converg ence rat e for p redictive densities 
in a Cesaro sense. The proof is based on calculations in iBarron fll987f ). 



Proposition 1. For a given vanishing sequence (e n ), let K n = {/ : K(f*,f) < e 2 }. If 
logII(K n ) > -nel then n~ l Zli E{JT(/*, jU)} < e 2 n . 

Proof. Let f* n denote the joint density for an iid sample (Y 1 , . . . ,Y n ), i.e., the n-fold 
product of the /*. Likewise, let f n denote the joint density of (Yi, . . . ,Y n ) under the 
Bayesian model with prior II, i.e., f n = J f n U(df). Since densities are non-negative, 

ru(df)> [ rii(df) = Ti(K n ) [ rn K »(d/), 

where II Kn is the prior II restricted and normalized to WL n . Therefore, if we define 
n(K n )/™' Kn as the lower bound above, then 

n- l K(r,r) < n-^KirJ^) -lo g n(K n )} 

<n~ l [ r)n K "(d/)-n-Mo g n(K n ), 

where the last inequality is by convexity of K. Recall the chain rule for the Kullback- 
Leibler number between product densities: K(f* n , f n ) = nK(f* , /). Therefore, 

n-'Ktr, f n ) < [ K(f, f) U K "(df) - n- 1 logII(K,„). 
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By definition of K n , the first term in the upper bound above is < e^, and, by the 
assumption on II(K n ), the second term is < e\. 

To complete the proof, we must connect n _1 i^(/*™, f n ) and the average in the state- 
ment of the proposition. For this we show that f n (Yi, . . . , Y n ) factors as a product 
fl7=i /t-i(^i) °f predictive densities. The key is 



/n 
l[f(Yi)U(df) 
i=l 

n—l 

/(y n )J]/(y i )n(d/) 



i=l 

n-1 



/(y B ) iWdf) • / ] J f(Y t ) u(df) 



i=i 



The first term in the last line is the expectation of f(Y n ) with respect to the posterior 
distribution Tl n -i, which is exactly f n -i{Y n )] the second term is the normalizing constant 
for The next step is to apply the same trick to the normalizing constant. That 

is, write it as an expectation of /(Y n _i) with respect to the posterior distribution n n _ 2 
times a new normalizing constant. This gives 

/n— 2 
YlfWnidf). 
i=l 

Continuing like this, we find that / n (Yi, . . . , Y n ) factors as niLi /t-i(^i)- Now, 

jrcr \ r) = e io g {r n (Y u . . . , y^/hy, . . . , y n ) j 
= ^Eio g {r(Y)/iU^)} 



^E[E{io g (r(y i )// i _ 1 (y i )) i^}]. 
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The conditional expectation is K(f*,f i ^ 1 ), so K(f* n ,f n ) equals ^)™ =1 E{iC(/*, /j_x)}. 
This, together with the < bound on n~ 1 K(f* n , f n ) completes the proof. □ 

Observe that the assumption of Proposition [1] is implied by e n -thickness of II at /*. 
Also, the Kullback-Leibler divergence can be replaced by the Hellinger distance via the 
well-known inequality h < K. That is, n~ l Y^=i /i-i)} ^ £ n- 

For another perspective, let f n = n' 1 Y17=i an avera g e of predictive densities. 
By convexity of h, h(f*, f n ) < n^ 1 J2i=i h{f*> fi-i)- Therefore, Propositio n [J says t hat, i f 
the prior is suitably concentrated around /*, then H(f*, f n ) = Op(e n ). As IWalkerl (120031 ) 
explains, the prior IT would have to be rather strange for this not to imply convergence 
of the predictive density f n itself at the same e n rate. 

It is interesting that the predictive densities, and averages there of, are asymptotically 
well-behaved with only local properties of the prior (lBarronlll999l ). This is particularly 
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important because posterior convergence rates involve a compromise between local and 
global global properties. For example, overall posterior convergence rates are determined 
by max{e n , e' n }, where e' n gives a global characterization of the complexity of the model. 
In many cases, e'„ is bigger than e n , s lowing down the overall posterior convergence rate; 



sec 



Ghosal and van der Vaartl (1200 ll ). Proposition [TJ requires no global conditions, so 



there is nothing slowing down convergence. So, although Proposition [TJ is a weaker result 
than full posterior convergence, it does provide some nice intuition. 



4 Convergence rates for the posterior 
4.1 Review of existing results 

There are essentially two kinds of theorems: the first kind makes assumptions on the 
"size" of the model F, and the second kind makes assumptions on how the prior prob- 
abilities are spread across F. Before proving the convergence rate theorem, we discuss 
these conditions in more detail. In particular, we show in Proposition [2] that the latter 
assumption is stronger than the former. Throughout this discussion, we silently assume 
that the prior II is e n -thick at /*, with constant C given in (HI). 



The first set of sufficient conditions are like those in Ghosal et al. (2000). Their 



concern is the existence of a suitable high mass, low entropy sieve. Let (F n ) be an 
increasing sequence of measurable subsets of F. The idea is that the sieve F n will be 
large enough to contain all the reasonable /'s, but also small enough to be covered 
by a relatively small number of Hellinger balls that are each easier to work with. Let 
N(e n , ¥ n , H) denote the Hellinger e n -covering number of F n , that is, t he minimum number 



of Hellinger balls of radius e n needed to cover F n . Theorem 2.1 of iGhosal et al.l ( 120001 ) 
assumes that the following condition ("S" for sieve) holds: 

Condition S. There exists a sieve F n cF such that, for sufficiently large n, 

(a) II(F£) < e~ rn£ ™, where r > C+ 1, and 

(b) logN{e n ,W n ,H)<ne%. 

Part (a) ensures that n assigns most of its probability to a large subset of F, and 
Part (b) guarantees that this "large" subset of F is not too large. As opposed to prior 
thickness, which is a local property, S(a) and S(b) are global properties. These conditions 
have, along with prior thickness, been verified for a variety of important priors, including 
Dirichlet process mixtures. 

Despite the nice geometric intuition of Condition S, identifying a suitable sieve is 
sometimes difficult in practice. Fortunate ly, there is an alternative sufficient condition 
("P" for prior), due to IWalker et al.l ( 20071 ). which can be easier to work with. 



Condition P. Let B n = {/ : H(f*, f) > e n }. For (A n j)j>i a covering of B n by Hellinger 
balls of radius 5 n < e n , and some constants c > and (3 > 1, the following holds: 

e- cn£ "J2 n ( A n,j) 1/t5 ->• 0. 
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The case /3 = 2 was considered in IWalker et al.l ( 120071 ) . This condition ensures that 



the prior is sufficiently concentrated near /*. That is, if the prior is too spread out, 
then those covering sets could get large enough posterior probability that the summation 
above is of exponential order. An advantage of Condition P is that it is directly related to 
the Bayesian problem, through the prior probabilities, and often these prior probabilities 
have a nice form. A nd by allowing m 1, the Condition P here is weaker than that in 



Walker et al.l (120071 ) with (3 = 2. However, in applications, the sets A n j are typically 
assigned exponentially small prior probability so it is not clear if f3 £ (1,2) is easier to 
verify or leads to any improvement in the rate of convergence. 

We claim that Condition S is, in a certain sense, more fundamental than Condition P, 
despite the fact that the latter is often easier in practice. To justify this claim, we prove 
that Condition S is actually weaker than Condi tion P. This connection be t ween th e two 



sets of conditions is implic it in Theorem 5 of iGhosal and van der Vaartl (l2007bl ). An 



analogous result is given in IChoi and Ramamoorthil (120081 . Theorem 4.4) in the context 
of posterior consistency. 

Proposition 2. Condition P implies Condition S. 

Proof. Without loss of generality, suppose that, for each n, the sets A n j are ordered such 
that U(A nA ) > U(A nt2 ) >■■■■ Also, let S n = n (Aj) 1//3 , which can be expressed as 
S n = e cne ™~^ n ) for some v(n) > such that v(n) — >■ oo. Take r > C + 1, and set 

J n 

F n = {jA nj , where J n = min{j £ N : f~ x > S^e rne ' 1 }. 

Clearly, log N(e n , F n , H) = \ogJ n < (^f)ne^ < ne n , so F„ satisfies Condition S(b). 

Next, the special ordering of U.(A nJ ) implies that m(A nj y^ < Ej=i n (X,j) 1//3 < S n 
for any J, which in turn implies that H(A n j) < S^/J 13 . Therefore, 

nra = n( |J aJ) < Y, n(^) < E f $ ^ e " rn£ "' 

j>Jn j>Jn j>Jn U 

so Condition S(a) holds as well. □ 

Theorem 1. Suppose U is e n -thick at f*. If either Condition S or Condition P holds, 
then II n ({/ : H(f*, f) > e n }) ->• in probability. 

Proof. The part involving Condition P follows from Proposition |2] and the part involving 
Condition S. The part involving Condition S is proved in Section [4~3l □ 

Although the result in Theorem [T] is known, the proof that follows will highlight the 
importance of predictive densities in the study of posterior convergence rates. The basic 
idea is that sets A n in F such that the predictive densities, restricted to A n , are not 
too close to /* will have vanishing posterior probability. These predictive densities are 
fundamental quantities in the Bayesian context, so the proof presented herein perhaps has 
a better Bayesian interpretation compared to those arguments based on, say, existence 
of a consistent sequence of tests. 
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4.2 A preliminary result 

Recall that IT^ denotes the posterior distribution of /, given Y\, . . . , Y^. For a set A n , 
we write H^ n for that same posterior distribution, but restricted and normalized to A n . 
Then we can define a corresponding predictive density: 



CT)= / mK-M), 

'An 



n. 



For a sequence of sets (A n ), let L n ^ = f. Ri(f) H(df) be the numerator of Ui(A n ) in 
i = 1, . . . , n. Note that L n = U(A n ). It is easy to check that 

L^i/Lnj-! = f&0Q/f*0Q, *= 1, ... ,71. 

For the cx-algebra generated by Y±, . . . , li-i, it follows that 

E{(L n>i /L n , i _ 1 ) 1 / 2 | = 1 - h(fj&). 



(6) 



The next result, akin to Proposition 1 in I Walker et al.l (120071 ). provides a convenient 



1/2 

fixed-n bound the expected value of L n ' iTl for suitable sets A n . 

Lemma 3. Let II be e n -thick at /*, C os m (jl]). // A n convex and de"^- separated 
from f*, with d>C + l, then E(L$) < n^) 1 ^ - **^- 



Proof. Start with the "telescoping product' 



( d-in, 



1/2 



- LJ n,n > ' 



n 



L ■ \ 1/2 



Taking expectation of both sides, conditioning on and using 

E(Ln 



n(yl„)i/2 



L • \ 1/2 



-'n.i— 1 



, gives 

n 



i=l i=l 

The assumed convexity of A n and its separation from /* together imply that 
Multiplying both sides by Yi{A n ) 1 ^ 2 completes the proof. 



□ 



The thickness assumption in Lemma [3] is not necessary, but it helps to set the notation 
for its prim ary application. This use of ratios of predictive densities is not new; see I Walker 
( 120041 ) and [Walker et al.l (2007). In fact, Lemma [3] is similar to the main conclusion in 
Proposition 1 of IWalker et al.l (120071 ). although the proof is a bit different. 



9 



4.3 Proof of Theorem ffl 

For M a sufficiently large constant to be determined, define B n = {/ : H(f*, f) > Me n }. 
For the given F„, it is clear that U n (B n ) < U n (¥ c n ) + IL n (B n n F n ). From Condition S(a) 
and Lemma [U we conclude that IT n (F^) — > in probability. We now turn attention to 
the second term, U n (B n fl F n ). 

Choose a covering B n fl F„ C Uj=i Ayj where each A n j is a Hellinger ball of radius 
Me n /2 with center in B n . By Condition S(b), J n = e Rn£ ™ for some R > 0. Now, since 
probabilities are < 1, we have 

Jn Jn -j Jn 

j=i j=i In i=i 

where I n = j R n (f) n(d/) is the denominator of U n (A n j), which is independent of j, and 
L n .nj = J A Rn(f) ^(df) is the numerator. From the triangle inequality argument in the 
example following Definitional we know that /* and A n j are (M 2 e 2 /8)-separated for all 
j = 1, . . . , J n . So, provided that M is sufficiently large, we may apply Lemma [3] to bound 

1 /2 

the expectation of the sum of Indeed, 

Jn Jn 

If we choose M such that M 2 > 4[(C + 1) + 2i2], then (application of Lemma [3] is valid 
and) the upper bound above vanishes as n — > oo. Now let 5„ = X)/=i ^(nj) anc ^ P^ c ^ 
c G (C + 1, M 2 /4 - 2i?). By Markov's inequality we have 

P(e cn£ " /2 S n > 77) < e - A ' 2 /4-2/?-c)n4/2 ^ 0> v r/ > 0. 

Also, by Lemma [T], we have P(I n < e~ cn£ ™) — > 0. A total-probability argument like in the 
proof of Lemma [2] gives 

P{U n (B n n F„) > 77} < P(5 n //y 2 > 77) 

= P(5 n //y 2 > 77, J n < e- c " £ ") + P(5 n ,//y 2 >rj,I n > e~ cne -) 

< P(J„ < e" c ^) + P(5„//y 2 > 77, J n > e"^-) 

< P(4 < e~ cnel ) + P(e cne ™ /2 S n > 77). 

Since both of these terms vanish, we conclude that Yl n (B n ) < n n (F^) + Tl n (B n fl F n ) — > 
in probability, i.e., H n ({f ■ H(f*, f) > Me n }) — > in probability. 

5 Extension to mis-specified iid models 
5.1 Notation and setup 

It can happen that the true density /* lies outside the support of prior. In such cases, the 
posterior cannot concentrate around /*. However, the posterior can exhibit concentration 
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properties around a different point in F. Specifically, take f° to be the / £ F that 
minimizes the Kullback-Leibler divergence, i.e., 

K*(r,f):=K(r,f)-K(f*J o )>0, V/GF. (7) 

An analysis of posterior concentration is presented in IKleijn and van der Vaart 

They show that, under certain conditions, the posterior distribution concentrates around 

the point f°. Indeed, if 



Rn(f) = Hf(Yd/f°(Y 1 ), 



i=l 



then the posterior is given by 



Then the goal is to show that U n (B^) — > 0, where B n is a shrinking neighborhood of f°. 
Here we give an analysis based primarily on predictive densities. First, we recall/revise 
some of our previous notions. 

Prior thickness. Let V*(f°,f) = J {log(f° / f)} 2 f* d\x. Then we have the following ana- 
logue of Definition [TJ i.e., the prior II is £ n -thick at f° if, for some constant C > 0, 

n({/ : K*(f°, f) < el V*(f°, f) < e 2 J) > e~ c < (9) 

It follows from Lemma 7.1 of IKleijn and van der VaartI fl2006h that the result of 
Lemma [1] above holds in the mis-specified case, i.e., 

P(4 < e~ cnel ) -> for any c >C + 1, (10) 

where I n — J R n (f) n(d/) is the denominator in (jHJ). 

Separation. For a distance on F, consider a weighted Hellinger distance H*, whose square 
is given by H*(f, ff = J (f 1/2 - / /1/2 ) 2 (/V/°) dfi. In the well-specified case, i.e., 
f° = f*, this is the usual Hellin ger distance. Since J(f/f°)f*dfjL < 1 for all / e F 
( IKleijn and van der VaartI 120061 . Lemma 2.3), we have 

£V /2 -i' 2 



<2-2|(-^) 1/2 r^. 



Write h*(f°,f) = 1 - f(f/f°) 1/2 f*dfi, so that H* 2 /2 < h*. Now we say that f° 
and a set A are (^-separated (with respect to h*) if h*(f°, f)>5 for all / £ A. 
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5.2 Convergence rate results 



First, we extend Proposition [TJ to the mis-specified case. The only noticeable change is 
the use of the Kullback-Leibler contrast K*(f°, f) in (j7|) instead of K(f*, f). 

Proposition 3. For a sequence (e n ), with e n — > 0, let K n = {/ : K*(f°,f) < e^}. If 
log H(K n ) > -nel then n" 1 Zli E{lf*(/°, /<-i)} < C 

Proof. Similar to that of Proposition [TJ use convexity of K*. □ 

As before, if f n is the average predictive density, /„ = n _1 Y!h=i /*-i> then Proposi- 
tion[3]and convexity of the Kullback-Leibler contrast implies K*(f°, f n ) = Op(e 2 l ). Also, 
the condition on II(K n ) is implied by prior thickness (EJ). Therefore, just like in the well- 
specified case, with only a local thickness condition on the prior, the predictive densities, 
or averages thereof, converge to the "best" density f° in the model F. 

Towards a posterior concentration result, given sets (A n ) in F, let L n j be the nu- 
merator of Hi(A n ) in (jSJ); note, L n = U(A n ). Then, as before, it is easy to check that 
L^/Ln^ = f&W/FiYi), i = 1, . . . , n. Also 

E{(L n , i /L„, i _ 1 ) 1 / 2 | = l - h*(f°, fh), i = 1, • • • , n. 

We can now anticipate a version of Lemma [3] for the mis-specified case. 

Lemma 4. Let LT &e e n -thick at f°, with C as in (jHJ). //^4n ^ s convex and de 2 n - separated 
from f°, with d>C + l, then E(L$) < n(A„) 1 / 2 e - dne ". 

Proof. Same as that of Lemma [31 □ 

To get a posterior convergence rate result, we must choose sets to be convex and 
suitably separated, with respect to h*, from f°. For this, a natural choice would be 
fP-balls. Indeed, the triangle inequality argument before, and the fact that H* 2 < h*/2 
shows that H*-belh centered away from f° with sufficiently small radius are separated 
from f°. Technically, a more complicated notion of "covering numbers for testing under 
mis-specficiation" are needed in these cases. However, if we assume F is convex, for 
simplicity, th en these special covering numb ers are bounded by ordinary i7*-covering 
numbers. See iKleijn and van der Vaartl ( 120061 ). Lemmas 2.1, 2.3, and the mixture model 
example in their Section 3. 

Theorem 2. Let ¥ be convex and U be e n -thick at f° with constant C as in flH])- Sup- 
pose there exists a sequence (F n ) such that II(F^) < e - rn£ n> where r > (7 + 1, and 
log N(e n ,W n ,H*) < ne 2 n . Then U n ({f : H*(f°,f) > s n }) -> in probability. 

Proof. Similar to that of Theorem [TJ □ 



6 Extension to independent non-iid models 
6.1 Setup and notation 

Let Yi, . . . , Y n be independent but not necessarily iid. To formulate this, we shall use 
some slightly different notation compared to the previous sections. Suppose that Yi ~ fg iy 
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where, for each 9 e 0, fei is a density with respect to a measure /ij on Yj, z = 1, . . . , n. 
An important example is the fixed-design Gaussian regression, i.e., Yi ~ N(9(xi), 1), 
where fixed covariate and O(-) is an unknown regression function. The new 9 

notation is simply to indicate that there is a single unknown characteristic 9, common to 
alH = 1, . . . , n; the manner in which 9 is used can differ across i, however. 
Let 9* denote the "true" value of 9. As before, define the likelihood ratio as 



Rn(6) = f[fei(Y i )/fe* i (Y i ). 
i=i 

If II is a prior distribution on 0, then the posterior distribution for 9, given observations 
Yi, . . .,Y n , is given by 

°.(fl) = n ( i»|y 1 ,...,y.) = j^U 5 Li , kb. (ii) 

The goal is to show that IT n (I^) — >■ for B n a shrinking neighborhood of 9*. Next we 
restate our main definitions. 

Prior thickness. Let 

/?„(r,£) = -Vfr(/^,/ ej ) and v#,fl) = Tnk/4 

n L — ' n ^ — ' 

i=l t=l 



where K and 1/ are defined in Section I2.2L Then we say the prior II is e n -thick at 
9* if for some constant C > 0, 



II({fl : K n {9\9) < el V n {9\9) < e 2 n }) > e~ Cn ^. (12) 



It follows from Lemma 10 of iGhosal and van der Vaartl ( I2007a| ) that the conclusion 



of Lemma [TJ above holds in the independent non-iid case. That is, 

P(4 < e~ cnel ) -)- for any c> C + 1, (13) 

where I n = J R n (9) U(d9) is the denominator in (]lip . 

Separation. For a distance on 0, we shall employ a type of mean-Hellinger distance H n , 
whose square is given by 



n 

H n (9\9) 2 =-J2H(f eH J ei ) 2 . 



n . 
i=i 



As usual, set h n = Then we say that 9* and a set A C are 5-separated 

(with respect to /z n ) if /i„(6'*, 9) > 8 for all 9 e A. 
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6.2 Convergence rate results 

Before stating the independent non-iid version of Proposition [TJ we need to set some more 
notation. Let denote the predictive density of Yj given Y ly . . . , Yj_i, i.e., 

, n. 



f(i-i)i(y) = J foi{y) ILi-i(d9), i = 1, 

Proposition 4. For a sequence (e n ), with e n — > 0, let K n = {9 : K n (9*,9) < e 2 t }. If 
logII(K n ) > -ne 2 n , then n" 1 ELi £{K{f 6H , f^)} < e\. 

Proof. The proof is similar to that of Proposition [T] once we set the appropriate notation, 
etc. Let f n denote the joint density for (Yi, . . . , Y n ) under the Bayes model. Then, just 
like in the proof of Proposition [Tj 

/n n 
JI/«(yOn(dfl) = U/ (i _ 1)i (y i ). 

i=l i=l 



It follows that K(fg*, f n ) = Y^=i /(i-i)i)}- Therefore, we can safely work with 

the notationally simpler n^Klf^, f n ). From this point, follow the proof of Proposition [T], 
i.e., restrict to K n and use convexity of the Kullback-Leibler number. □ 

For posterior convergence rates, take a sequence of subsets (A n ) in and let L n ^ be 
the numerator of Ui(A n ) in ffTT]) . where L n $ = U(A n ). As before, we have 

WA»,i-i = i = 1, • • • , n, n > 1. 

where ful^ is the predictive density from before, but with the posterior restricted 
to the set A„c6. Also, 

E{(L n , i /L n , < _ 1 ) 1 /2 | = l - hifenj^), 

where /i = if 2 /2 and if is the usual Hellinger distance between densities. With this, we 
are ready for an analogue of Lemma [3] for the independent non-iid case. 

Lemma 5. Let U be e n -thick at 9* , with C as in (fT2j) . If A n is convex and de 2 n - separated 
from 9*, with d>C+l, then E(L^) < Yl(A n ) 1 / 2 e- dn ^ . 

Proof. Just like the proof of Lemma [3j □ 
In the following theorem, we shall also need a type of max-Hellinger distance, 

H n ,oo(0*,0) = max H(f 0H ,f ei ). 

l<i<n 

Also let h niOQ = This additional sort of distance will be needed for the general 

construction of sets which are both convex and sufficiently separated from 9*. Some 
remarks on removing the need for H nt00 are given following the proof. 

Theorem 3. Let U be e n -thick at 9* with constant C as in (fl2|) . Suppose there exists a 
sequence (G n ) such thatYi{Q c n ) < e~ rn£n , where r > C+l, and log N(e n , Q n , H nj00 ) < ne^. 
Then U n ({9 : H n {9\ 9) > e n }) -> in probability. 
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Proof. For a constant M > to be determined, let B n = {9 : H n (9*,9) > Me n }. It 
suffices to show that U n (B n n O n ) — in probability. We cover £> n fl 9 by if ni00 -balls 
y4 n j of radius Me n /2 with centers in S n , where j = 1, . . . , J n and J n < e Rn£n , R > 0. 
That is, for suitable 9j satisfying H n (9*,9j) > Me n , take 

A nj = {9 : H nj00 (9j, 9) < Me n /2}, j = 1, . . . , J n . 

Note the use of H n)00 in the definition of A n j as opposed to H n . Everything will carry 
through as before as soon as we show that the A n j are convex and (M 2 £^/8)-separated 
from 9*, with respect to h n . For convexity, let $j, % — 1, . . . , n, be any probability measure 
on A n j. By convexity of h we get 

frfe, Uii) < / /«) ®i(d9), i = l,...,n. 

By definition of A n j, the right-hand side is bounded by M 2 e^/8. From here it follows 
that A n j is convex and, in particular, predictive densities f^l^, restricted to A n j, have 
properties like those densities fe% with 9 G A n j. Our use of the max-Hellinger metric is 

A ■ 

necessary here because the measures $j can vary with i, just like the posteriors nj vary 
with i. Now, for separation, given 9 e A n j, the triangle inequality for H n gives 

H n (9*, 9) > H n (9*, 9j) — H n (9j, 9). 

The first term is greater than Me n by the choice of 9j. The second term is less than 
H n ,<x{9j,9) which is less than Me n /2 by the definition of A n j. Therefore, 9* and A n j 
are (M 2 e^/8)-separated with respect to h n . Now we may apply Lemma to each A n j 
just like in the proof of Theorem [TJ to show that if M is large enough, then H n ({9 : 
H n (9\ 9) > Me n }) ^ in probability □ 

The use of a max-Hellinger metric in Theorem |3] can be avoided in some cases, e.g., if 
H n is equivalent to some fix ed metric on 0-space. One s pecific example is nonparametric 
regression using splines; see iGhosal and van der Vaartl fl2007al . Sec. 7.7). 



7 Extension to Markov process models 
7.1 Setup and notation 

Let (Y n : n > 0) be an ergodic Markov process on Y with transition density fe(y' \ y) 
and stationary density ug(y), both with respect to a a-finite measure [i on Y, and both 
indexed by 9 £ O. That is, the transition density fg characterizes the one-step moves 
Y„ — > Y„.4--\ of the process, and up t he limiting marginal distribution of Y n . Here, like 



in 



Ghosal and van der Vaartl (j2007al ). we assume the process is at stationarity, i.e., that 
Yq ~ ug*, so that all the marginal distributions are the same and equal to ug*. The goal 
here is estimation of the unknown index 9. Methods developed in the previous sections, 
particularly in Section El shall be used to prove posterior convergence rate theorems. 
Following the previous setup, let 9* denote the "true" 9 value. Now define 



Rn(9) = ] 



fe(Yi | Yi_ x ) u e (Y ) 
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as the likelihood ratio for (Yq, Yy, . . . , Y n ). For a prior distribution II on and a measur- 
able set A C O, Bayes theorem gives the posterior distribution for / as follows: 

, ^ x f A R n (9)Md9) . 

IIJA) = MA \Y ,...,Y n )= \ A ) . 14 

The primary goal of this section is to investigate the convergence of U n (A n ), where A n 
is the complement of a shrinking neighborhood of 9*. The notion of "neighborhood" is 
more difficult here than in the previous cases; see Section 17.21 below. 

Prior thickness. For concentration properties of the prior II, consider 

K(9\9) = J K y (fg*,f e )ue*(y)Kdy), 
V(9\9) = [ V y (f e *J e )u e *(y)fx(dy), 



where K y (fg*, fg) = K(fg*(- \ y), /#(• | y)) is the usual Kullback-Leibler divergence 
for densities; V y is defined similarly, for V as in Section 12.21 Let 6 be the set of all 
#'s such that both K(ug*,ug) and V(ug*,ug) are bounded by 1. With this notation, 
we say that the prior II is e n -thick at 9* if, for some constant C > 0, 



II({fl e e : K{6\ 9) < el V{6\ 9) < el}) > e~ Cn ^. (15) 



Lemma 10 of Ghosal and van der Vaart I f l2007ah gives an analogue of Lemma [1] 



above for the present dependent data case. That is, for C as in ( 1151) . 

P(4 < e~ cne ") ^0 for any c> C + 1, (16) 
where /„ is the denominator in (1141). 



Separation. Let H y be the usual Hellinger distance on transition densities with fixed state 
y, i.e., H v (fg*J e ) = H(f *(- | y)Jg(- \ y)). Also let h y = H 2 y /2. Now define the 
max-Hellinger (semi)metric H 00 {9' k ,9) = sup y H y (9*, 9). We say that 9* and a set 
ACQ are 5-separated (with respect to hoc) if h 00 (9*, 9) > 5 for all 9 G A. 



7.2 Convergence rate results 

To start, consider the predictive density problem of Section |3j In this case, the predictive 
density is itself a transition density. In particular, we have 

fi-i{y I ii-i) = J fe(y\Y i ..i)n i -. 1 {d0), i = l,...,n, 

the expected transition density with respect to the posterior distribution Ilj_i. This is a 
typical Bayes estimate of the transition density, and the claim is that it converges to the 
true transition density fg* as n — > oo. In particular, we have the following convergence 
rate result for predictive densities. 
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Proposition 5. Given (e„) with e n — > and ne\ — > oo, /et K n = {9 : K(9*, 9) < 
el K(u e *,u e ) < oo}. I/logllQKn) > -ne*. T/ien n~ l ^E^^, < e*. 

Proof. Let / n denote the joint density for (F , • ■ • , Y n ) under the Bayes model. Then 

/n n 
myo) n fo(Y i Y i-i) n(^) = wo(^o) n z*-!^ i 
i=l i=l 

just like in the proof of Proposition HJ where tto(^o) = / U0(Yo)n(d0). Another simple 
calculation shows that if ffi, is the joint distribution of (Y , Y 1 , . . . , Y n ) under 9*, then the 
joint Kullback-Leibler divergence K(fg*,f n ) equals 

1 «o( y o) \Ai=i fi-i( Y i | Yi-i) J i=1 

Observe that n~ x K{u^, u ) = 0{e%) and, on K n , n~ l K{uo*,ue) = O(e^). From here, the 
proof is just like that of Proposition [TJ □ 

As before, the assumption of Proposition is implied by prior thickne ss at 9*. The 
theorem also extends the result in Corollary 2.1 of iGhosal and Tang (l2006h . Indeed, our 
result is n~ l YH=i ^Y i ^ 1 (f*, fi-i) = Op(e 2 1 ), which is stronger than the op(l) obtained by 
these authors. Our version of the Kullback-Leibler property is more strict than theirs, 
but this is typical when convergence rates are sought. 

Let (A n ) be a sequence of measurable subsets of O, and let L n ^ = J A Ri{9) H(d9) be 
the numerator of the posterior probability Ui(A n ) in (|14p . Then 

L n ,i/L n: i-i = ft\{Yi | Y^/f^Yi | Y^), i = 1, . . . ,n, n > 1, 

where f^\ is the predictive transition density when the posterior IL,_i is restricted to A n . 
We also have 

E{(L n ,/L n ^ 2 | ^U} = 1 - h Yi _ l (f9*j£i), i = 1, • • • , n. (17) 

We can now present an extension of Lemma [3] for the case of Markov processes. 

Lemma 6. Let U be e n -thick at 9* , with C as in (|15[) . If A n is convex and de 2 n - separated 
from 9*, with d>C+l, then E(L^) < n(A n ) 1 / 2 e - dn£ « . 

Proof. Exactly the same as that of Lemma [3j □ 

The fact that data Y^_i appears as part of the formula for the distance hy i _ 1 in 
([IT]) necessitates the use of the max-Hellinger metric, i.e., separation with respect to hoo 
implies separation with respect to h y for any y, even if y is random. But we are free 
to formulate the convergence rate theorem with a different metric. Here we consider 
Hq(9*,9) = f H y (fg*, fe) Q(dy), where Q i s a probability measure on Y. In the non- 
linear Gaussian autoregression example in IGhosal and van der Vaartl (j2007al . Sec. 7.4), 
the measure Q is taken to be a two-point location mixture of Gaussians. 
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Theorem 4. Let U be e n -thick at 6* with constant C as in (|15[) . Suppose there exists a 
sequence (G n ) such that II(0£) < e~ rn£n , where r > C+l, and logN(e n , Q n , ifoo) < ne^. 
T/ien n n ({0 : H Q (9*,9) > e n }) -> m probability. 

Proof. The proof is similar to that for the independent non-iid case. In particular, we 
cover the complement of a mean-Hellinger neighborhood of 6* by max-Hellinger-balls. 
The convexity and separation calculations are analogous to those in Theorem [3X and the 
rest of the argument goes just like in the proof of Theorem [TJ □ 



8 Discussion 

Here we have presented an analysis of Bayesian asymptotics based primarily on predictive 
densities. These densities are fundamental quantities in Bayesian statistics, for they are 
Bayes density estimates under a variety of loss functions. So, in this sense, our analysis 
has a stronger Bayesian flavor than other existing approaches. We have also demonstrated 
how our basic approach can be tuned to handle a variety of models — iid, mis-specified 
iid, independent non-iid, and dependent Markov processes. For example, essentially the 
same predictive density convergence rate result holds in all these contexts. 

We have opted here for simplicity of presentation rather than strength of results. 
For example, one can easily tailor the analysis, taking more efficient choice of coverings, 
etc, to achieve sharper rates. In particular, to achieve n _1//2 rates in finite-dime nsional 



parametric models, a special type of covering is required (e.g., iGhosal et all 120001 Theo- 
rem 2.4), and this can be incorporated into the present analysis. On the other hand, if 
convergence of predictive densities is the only concern, then these special coverings are 
not necessary — only local thickness of the prior is ne e ded. I ndeed, it is straightforward 



to follow the argument in IGhosal and van der Vaartl (l2007aL Sec. 7.7) to show that, in 



a nonparametric regression context, where the true regression function 6* lies in an a- 
smooth function class, and a spline-based prior is used, the predictive densities converge, 
in the sense of Proposition HJ at the minimax rate n~ a ^ 2a+1 \ In this example, however, 
Ghosal and van der Vaart's analysis gives full convergence of the posterior at the same 
rate under basically the same assumptions. But there may be some cases where the 
weaker conditions of the predictive density convergence theorems may be more useful. 

• Consider a basic iid Bayesian density estimation problem. For Dirichlet process 
location-mixtures of Gaussians, care must be taken in choosing a prior for the 
common component scale a. This is like the choice of bandwidth in classical density 
estimation. Typical conditions restrict the amount of mass the prior for a can place 
near zero. However, these conditions are primarily needed for the control of model 
entropy — when a is near zero, the class of possible models is enormous, making the 
entropy large. But if convergence of predictive densities is the question of interest, 
so that only local thickness is required, as in Proposition (TJ then entropy is not a 
concern. Therefore, one can expect practically weaker assumptions on the prior if 
the focus is on convergence of predictive densities. 

• For dependent data models, there may be some advantage to the predictive density- 
based approach. Indeed, in Section [71 convergence of the predictive densities in 
Proposition |5] follows without any assumptions on the mixing of the process. This 
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is due to the fact that only "first moment" conditions — bounds on the Kullback- 
Leibler number — are needed. Com pare this to the posterior convergence analysis in 



Ghosal and van der Vaartl f l2007al . Sec. 4) which requires assumptions on the mixing 



of the process and some "higher-than-second moment" conditions. 

Finally, we mention that this investigation began by looking at a predictive density 
analysis of the posterior by using a law of large numbers for martingale difference arrays. 
Unfortunately, that approach seemed to be somewhat limited; specifically, a non-trivial 
extension to a uniform martinga l e law of large numbers is needed. That idea is still 



interesting, see Martin and Hongl (120121 ). although the results here are stronger. 



References 

Barron, A. (1987), "Are Bayes rules consistent in information?" in Open Problems in 
Communications and Computation, eds. Cover, T. M. and Gopinath, B., Springer- 
Verlag, New York, pp. 85-91. 

Barron, A., Schervish, M. J., and Wasserman, L. (1999), "The consistency of posterior 
distributions in nonparametric problems," Ann. Statist., 27, 536-561. 

Barron, A. R. (1999), "Information-theoretic characterization of Bayes performance and 
the choice of priors in parametric and nonparametric problems," in Bayesian statistics, 
6 (Alcoceber, 1998), New York: Oxford Univ. Press, pp. 27-52. 

Bhattacharya, A. and Dunson, D. B. (2010), "Nonparametric Bayesian density estimation 
on manifolds with applications to planar shapes," Biometrika, 97, 851-865. 

Choi, T. and Ramamoorthi, R. V. (2008), "Remarks on consistency of posterior distri- 
butions," in Pushing the Limits of Contemporary Statistics: Contributions in Honor 
of Jayanta K. Ghosh, Beachwood, OH: Inst. Math. Statist., vol. 3 of Inst. Math. Stat. 
Collect, pp. 170-186. 

Doob, J. L. (1949), "Application of the theory of martingales," in Le Calcul des Prob- 
abilites et ses Applications, Paris: Centre National de la Recherche Scientifique, Col- 
loques Internationaux du Centre National de la Recherche Scientifique, no. 13, pp. 
23-27. 

Ghosal, S., Ghosh, J. K., and Ramamoorthi, R. V. (1999), "Posterior consistency of 
Dirichlet mixtures in density estimation," Ann. Statist., 27, 143-158. 

Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000), "Convergence rates of posterior 
distributions," Ann. Statist., 28, 500-531. 

Ghosal, S. and Tang, Y. (2006), "Bayesian consistency for Markov processes," Sankhya, 
68, 227-239. 

Ghosal, S. and van der Vaart, A. (2007a), "Convergence rates of posterior distributions 
for non-i.i.d. observations," Ann. Statist., 35, 192-223. 



19 



Ghosal, S. and van der Vaart, A. W. (2001), "Entropies and rates of convergence for 
maximum likelihood and Bayes estimation for mixtures of normal densities," Ann. 
Statist, 29, 1233-1263. 

- (2007b), "Posterior convergence rates of Dirichlet mixtures at smooth densities," Ann. 
Statist, 35, 697-723. 

Ghosh, J. K. and Ramamoorthi, R. V. (2003), Bayesian Nonparametrics, New York: 
Springer- Verlag. 

Kleijn, B. J. K. and van der Vaart, A. W. (2006), "Misspecification in infinite-dimensional 
Bayesian statistics," Ann. Statist., 34, 837-877. 

Lian, H. (2009), "On rates of convergence for posterior distributions under misspecifica- 
tion," Comm. Statist. Theory Methods, 38, 1893-1900. 

Martin, R. and Hong, L. (2012), "A law of large numbers for martingale arrays with ap- 
plications in nonparametric estimation," Unpublished manuscript, arXiv. 1201 .3102. 

Pati, D., Dunson, D. B., and Tokdar, S. T. (2011), "Posterior consistency in conditional 
distribution estimation," Unpublished manuscript. 

Schwartz, L. (1965), "On Bayes procedures," Z. Wahrs. verw. Geb., 4, 10-26. 

Shalizi, C. R. (2009), "Dynamics of Bayesian updating with dependent data and mis- 
specified models," Electron. J. Stat., 3, 1039-1074. 

Shen, X. and Wasserman, L. (2001), "Rates of convergence of posterior distributions," 
Ann. Statist, 29, 687-714. 

Tokdar, S. T. (2006), "Posterior consistency of Dirichlet location-scale mixture of normals 
in density estimation and regression," Sankhya, 68, 90-110. 

Walker, S. (2003), "On sufficient conditions for Bayesian consistency," Biometrika, 90, 
482-488. 

- (2004), "New approaches to Bayesian consistency," Ann. Statist., 32, 2028-2043. 

Walker, S. G., Lijoi, A., and Priinster, I. (2007), "On rates of convergence for posterior 
distributions in infinite-dimensional models," Ann. Statist., 35, 738-746. 

Wu, Y. and Ghosal, S. (2008), "Kullback Leibler property of kernel mixture priors in 
Bayesian density estimation," Electron. J. Stat., 2, 298-331. 

- (2010), "The Li-consistency of Dirichlet mixtures in multivariate Bayesian density 
estimation," J. Multivariate Anal, 101, 2411-2419. 



20 



